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ABSTRACT 

Existence of some extra-genetic (epigenetic) codes 
has been postulated since the discovery of the 
primary genetic code. Evident effects of histone 
post-translational modifications or DNA methylation 
over the efficiency and the regulation of DNA proces- 
ses are supporting this postulation. EMdeCODE is an 
original algorithm that approximate the genomic dis- 
tribution of given DNA features (e.g. promoter, 
enhancer, viral integration) by identifying relevant 
ChlPSeq profiles of post-translational histone 
marks or DNA binding proteins and combining 
them in a supermark. EMdeCODE kernel is essen- 
tially a two-step procedure: (i) an expectation- 
maximization process calculates the mixture of 
epigenetic factors that maximize the Sensitivity 
(recall) of the association with the feature under 
study; (ii) the approximated density is then recur- 
sively trimmed with respect to a control dataset to 
increase the precision by reducing the number of 
false positives. EMdeCODE densities improve signifi- 
cantly the prediction of enhancer loci and retroviral 
integration sites with respect to previous methods. 
Importantly, it can also be used to extract distinctive 
factors between two arbitrary conditions. Indeed 
EMdeCODE identifies unexpected epigenetic profiles 
specific for coding versus non-coding RNA, pointing 
towards a new role for H3R2me1 in coding regions. 

INTRODUCTION 

The role of epigenetic mechanisms as modulators of tran- 
scriptional activation and repression, DNA methylation, 
cell memory maintenance, stem cell differentiation. 



cancerogenesis and other major genomic processes have 
been widely demonstrated [reviewed in (1-3)]. Moreover, 
it has been already observed that epigenetic features such 
as histone post-translational modifications and DNA 
methylation can act in concert to promote or silence 
specific cellular functions (4). Accordingly, I have 
recently shown that retroviral insertional site selection 
into the host DNA could be strongly influenced by a set 
of specific histone modifications acting as a sort of beacon 
for incoming retroviruses (5). This information could be 
taken as the prelude to postulate the existence of a 
not-yet-well-defined epigenetic code (6). 

Here I propose a statistical approach to identify patterns 
of epigenetic signatures associated with specific genomic 
functions. Relatively new technologies, notably ChlPSeq, 
are able to track the position of DNA binding proteins as 
transcription factors or histones carrying specific modifica- 
tions and yield a genome-wide density profile. When search- 
ing for association to a given genomic process, this 
information might be probabilistically interpreted and 
combined to generate a new virtual mark of greater statis- 
tical power with respect to the 'real' mark alone. Several 
approaches have been developed. Some (7-10) are based 
on unsupervised genome segmentation where the aim is to 
find clusters of epigenetic factors repeated over the genome 
(transcription factor binding sites, histone marks, DNase 
profile etc.). These patterns are then associated 'a posteriori' 
with known annotations as promoter or enhancer regions. 
As another example, (11) uses Bayesian networks to explores 
causal relationships among clusters of DNA binding 
proteins in Drosophila and infers direct or indirect inter- 
actions among them. In general, unsupervised methods are 
useful to label large portions of the genome but they lack a 
false-positive correction resulting in poor performance espe- 
cially in terms of positive predicted value (PPV or Precision 
<0.5) (10). On the other hand, supervised methods use 
training subsets of specific functional regions (enhancers 
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and promoters in most cases) to predict the genomic position 
of other unknown functional sites (12,13), and they use a 
control (random) dataset to increase the Precision. Addi- 
tionally these methods yield better Sensitivity (Recall) than 
unsupervised methods (12). Previously 1 implemented a 
supervised heuristic methodology based on F score statistics 
to create a 'supermarker' out of ChlPSeq profiles (5). This 
methodology requires the pre-calculation of individual F 
scores for each mark, the selection of a subset of significant 
marks and the eventual addition or deletion of some marks 
to the subset. EMdeCODE is a general extension of this idea, 
based on mixture modeling Expectation and Maximization 
(EM) and statistical selection. It is specifically designed to 
deal with post-translational histone modifications and it 
presents some peculiar features. The input data is a set of 
peaks calculated by an independent peak caller from row 
ChlPSeq data [similarly to (9)]. The rationale behind this 
design is to decouple the peak calling task that can be ac- 
complished by several continuously improved algorithms 
[see (14) for a review] from the statistical combination of 
histone marks. The advantage is 2-fold: (i) EMdeCODE is 
independent from current and likely soon obsolete treat- 
ments of ChlPSeq data; (ii) EMdeCODE can work with 
peaks called from different algorithms and from different 
experiments. The peak caller can be chosen accordingly 
with the expected ChlPSeq profile and the experimental con- 
ditions, for example HOMER (1 5) for narrow peaked marks 
as H3K4me3, and RSEG (16) for wider distribution as 
H3K27me3. Differently from all other approaches, for 
each input peak set, EMdeCODE recreates a new histone 
mark distribution by interpreting each peak as a single or 
stretch of nucleosonies (accordingly to the genomic size of 
the peak) and modeling the nucleosome occupancy by 
Gaussian functions of equal amplitude and variance 
( = 200 bp). The idea here is to drastically reduce the experi- 
mental noise and facilitate the interaction of data obtained 
by different experiments. These new distributions are then 
combined during the training phase to generate a genome- 
wide probability mass distribution that approximate the 
unknown functional distribution. Here, another peculiarity 
of EMdeCODE is the maximization of the F Score, a 
weighted combination of precision (positive predictive 
value) and recall (Sensitivity), to reduce the influence of 
false negatives similarly to what presented in (5). The main 
difference is the addition of the EM procedure that combines 
a set of factors and selects the relevant subset without super- 
vision. As shown in the Results section, a consequence of this 
approach is a gain of precision with respect to other 
methods. EMdeCODE can therefore be used to discover 
new associations between marks and genomic functions. 
Applications to enhancer identification in comparison with 
previous supervised methods, retroviral integration site 
prediction and to discriminate between coding versus 
non-coding genes are discussed. 

MATERIALS AND METHODS 

Input data 

High-Throughput Sequencing-based applications usually 
require large amount of computational power and 



software tools. The minimal computational ChlPSeq 
pipeline consists of a sequence aligner that maps millions 
of reads produced by the sequencer to the proper reference 
genome and a peak finder to identify enriched regions with 
respect to a control background according to some user 
pre-defined parameters [see (17) for a review about 
ChlPSeq and other similar technologies]. EMdeCODE 
takes as input the final peak set and assumes that these 
steps are correctly performed with proper controls and 
sufficient profile quahty. In particular, aU ChlPSeq 
histone profiles used in this study have been pre-processed 
with Fseq (18) for it produces good-quahty peak set from 
both narrow and broad read distributions (14). 

The algorithm 

Here M DNA-binding factors are considered. Formally, 
each ChlPSeq profile for factor Xj could be interpreted as 
the probabihty to find Xj bound to a specific genomic 
region of size w centered on i = ( clir, pos ) : 
pxj = p{Xj = /). This mass density is modeled by 
EMdeCODE with a sum of Gaussian functions, each 
one centered on one nucleosome. Nucleosome positions 
are extracted from ChlPSeq peaks by considering a 
genomic occupancy of 200 bp. This reduces the potential 
bias that can arise by combining ChlPSeq densities 
obtained over different experimental conditions. Briefly, 
each marker probability density function is written as 

1 X— \ -I'-p)" 

p{X=i) = -Y,e-^, (1) 

where F is the peak set of factor Z, and C is a normaliza- 
tion factor. 

Similarly, the biological event could be modeled as a 
random variable B taking values on all chromosomal 
loci. In other words, the event modeled by B has an 
unknown (mass) probabihty to occur in the region 
centered on /.■ pb= p{B — i) . 

The aim is to find a new mass probability functional 
that can approximate ps by means of the px-^'- 
Pb ^ ':^[px^,Px2,-,Pm\ 

Assuming that B has been observed to occur on loci 
{/ci,...,A>} , Pb can be written as 

PB = p{B = k) = = k\Xj = i)p{Xj = i) 

i 

^p{B^k\Xj^k)p{Xj^k)+ Y^p{B^k\Xj^i)p{Xj^i) 

(2) 

where ; spans across all possible loci. 

The first right term in (2) can be interpreted as how well 
pXj fits pb. The second term is the related error expressing 
the spreading of px. over loci unrelated to B. 

Summing (2) over /: 

^ (3) 
^^ p{B=k\Xj=i) ,y - ^ 

J ¥k 
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where pB\e = Ylf <^iPi^i)^ ® — («1v>o!m) and e is the ag- 
gregate of all M approximation errors. 

0 can be estimated by a classical Maximum Likelihood 
approach. Indeed, maximizing the (log)hkelihood of pB\e)'- 



M 



log(L(0|fi)) = log(^^|(,) = \o%\^ajp{X^y 

is equivalent to a mixture-density parameter estimation 
problem and can be efficiently treated with the Expec- 
tation Maximization algorithm (EM) (19). 

This algorithm works in a way that, at the t-esim 
iteration, the expectation function g(0,0*'"'') = Ey 
[logL(0|S,y)|_6,0*'"''] is maximized {Y is an auxihary 
random variable), that is, 0*'' = argniax^ g(©, ©('"''). 
As previously derived in (19), by introducing the probabil- 
ity that value k arises from the j-esim distribution. 



P{j\k,@) 



M 



(4) 



^ aip(Xi = k) 



(5) 



2(0,0^' '') can be written as 
G(0,0<''") = ^^log(aX/l^,0<'-'>) 

J k 

The mixture coefficient vector 0 is evaluated by 
maximizing (5) by the Lagrange multiplier a with con- 
straints ^, 0!/ — 1 and (4): 

«r' = - v w,0<'-') = v ^ 



N ^ p{B ^ k\<SP-^)) 



1 c^.'-VX; = A-) 



(6) 



N 



where N is the number of discrete genomic loci all 
densities have support. 

From (3) the following identity holds: 

a^;~'^p{Xj ^k)^ j^p(B = k,Xj = A-|0('-") 
Introducing the concept of Sensitivity (RecaU), defined 



as 



P(B,X) 
P(B) 



equation (6) can be simply written as: 



„(') 



l-PiB\X„&^^-'^) = l-R0'~^y) 



In other words, each factor is weighted proportionally to 
the respective Sensitivity. 

The spreading error e can be interpreted as the affinity 
of factor Mj for a control dataset C, defined as the set of 
random loci where p{B = c) — 0,Vc e C. 

Indeed, considering a control dataset with Nc control 
loci, the spreading error could be estimated as: 



^p{C=i\Xj=i) 



j i 



M 



piXj = i) 



MN, 



where fpj — NcP{C,Xj) is the number of false positives, 
that is, the number of control loci localizing in the 
region where factor Xj is bound. Therefore, minimizing £ 
impHes the minimization of false positives. 

The strategy adopted here to approximate /i^ is a two- 
step procedure where pb\b is calculated by (6) and low- 
probability peaks of pb\b are trimmed out to reduce the 
number of false positives (the whole procedure is sketched 
in Figure 1). Obviously this could also reduce the Sensiti- 
vity, that is, the number of peaks associated to B. To 
measure the quality of the approximation, the Fp score is 
evaluated after trimming to find the optimal tradeoff 
between Sensitivity and false positives reduction (Precision). 

Formally, the Fp score is defined as the P-weighted 
harmonic mean of Precision(P) and Sensitivity(R): 

Here, (3 = 0.5 to give more weight to Precision than to 
Sensitivity. This balances type I and type II errors by ad- 
justing for the high rate of False Positives (f 'p) inherent to 
the examination of large datasets for genome-wide binding 
sites according to statistical significance. Moreover, to 
make the F score insensitive to data size, the number of 
false positives has been normalized with respect to N [F 
score-based statistics and comparison with other 
measures have been extensively discussed in (5)]. 

Eventually, the algorithm is fined out as follows: 

(1) Pb\b is evaluated by (6). This corresponds to weighting 
each factor distribution with the respective Sensitivity. 

(2) pb\b is trimmed to reduce the number of false posi- 
tives by removing smaU peaks until F0.5 score is 
maximized. Accordingly, the support of mass prob- 
ability functions p{Xj — i) is reduced. 

(3) If F0.5 score has increased, step 1) is repeated or else 
it ends. 

EMdeCODE has been implemented in Python and can be 
downloaded from http://seaseq.unige.ch/~fsantoni/ 
EMdeCODE. 

RESULTS 

Comparison with previous methods 
Cross validation 

To compare the performance of EMdeCODE with previ- 
ously pubfislied algorithms, a 5-fold cross vafidation has 
been implemented over a set of 74 well-defined enhancers 
according to the experimental settings proposed in (20), 
but replacing ChlP-ChIP signals for H3K4me3 and 
H3K4mel with the corresponding ChlPSeq profiles (21). 
As background, 740 random genomic sites have been se- 
lected. EMdeCODE scored a PPV of 96.9% ±1.1 with an 
optimal window of 300 bp. Notably, this window is con- 
siderably smaller than 2000 bp reported elsewhere 
(12,13,20), probably owing to the higher resolution 
provided by ChlPSeq data. Comparison with previous 
methods is reported in Table 1. 

Enhancer identification 

EMDdeCODE has also been used to reconstruct the p300 
binding sites distribution in CD4+ T cells, as a strong 
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1. Mark Density Reconstruction 



In put Peaks 
(from a Peak Caller) 



Nucleosome 



position 




M ark Density 
Reconstruction 



2. EM for Maximizing Sensitivity 



Training Set 




■ Markl+Mark2 



3. Optimization for False Positives 



Iterative Threshold 




Supermark 




Figure 1. (1) EMdeCODE reconstructs mark density distributions by placing tlie appropriate number of nucleosomes into each peak region and 
modeling nucleosome occupancy by a Gaussian distribution with variance 200 bp. (2) Expectation Maximization evaluates the sensitivity of each 
mark to true positives (filled round dots). Marks are weighted accordingly and linearly combined to generate the raw supermark. (3) Peaks of the 
supermark accounting for random loci (crosses: false positives) are trimmed off the supermark to optimize the F score. 



marker for enhancer identification (22). Thirty-nine 
ChlPSeq histone modifications (acetylation and methyla- 
tion) profiles have been used as input (see 'Materials and 
Methods' section). Similarly to the analysis proposed by 
Firpi et al. (12), EMdeCODE was trained with 213 p300 
distal binding sites [at least 2.5 Kbp from a known 
Transcription Starting Site (TSS)] overlapping PreMod 
(23) predicted enhancers and 2130 random genomic 
sites. Here, two loci are considered as overlapping if 
their genomic distance is <2 Kbp. 



EMdeCODE generated 22 878 whole genome putative 
enhancer sites, 53% overlapping (12165) with the 36 796 
sites generated by CSI-ANN [(12), data in Supplementary 
Material]. To vahdate this prediction, the EMdeCODE 
generated enhancer dataset has been compared with the 
whole genome distal p300-enriched regions [>2.5Kbp 
from TSS, calculated by SICER (24)]. In all, 74% (6069/ 
8245) of p300 regions overlap with EMdeCODE predic- 
tions, 29% more than CSI-ANN (45%, 3740/8245). When 
compared with a background dataset to evaluate the 
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number of false positives (70 000 random selected genomic 
sites), EMdeCODE predictions obtained an F score of 
0.91 (Precision = PPV = 96%, Recall = Sensitivity = 
74%), whereas the CSI-ANN F score was 0.74 
(Precision = PPV = 89%, Recall = Sensitivity = 45%). 
This is somehow expected because EMdeCODE aims to 
maximize the F score. A visual comparison between 
EMdeCODE and CSI-ANN is reported in Figure 2 by 
Chromosome Projection Mandalas (5). It is interesting 
to observe that, consequently, EMdeCODE improves 
both PPV and Sensitivity. Previously it has been shown 
that enhancers are enriched with DNase I sensitive sites 
(13,22,25). Indeed, 80% of EMdeCODE predictions is 
supported by at least one among p300, DNase sites (26) 
and PreMod predicted functional sites, 20% more than 
CSI-ANN (60%). As additional comparison, 49 746 
p300 distal binding sites were obtained by peak calhng 
from the original p300 ChlPSeq data. In all, 18 759 
(82%) EMdeCODE-generated sites were supported by 
p300 peaks, 45% more than CSI-ANN (37%). Together, 
these results indicate that EMdeCODE is a more effective 
algorithm to approximate the p300 distribution and a 
superior enhancer predictor. 

Enhancer 'Code' 

The composition of the enhancer 'code' generated by 
EMdeCODE (reported in Supplementary Figure SI A) is 
dominated by H3K36ac, an highly conserved histone 
modification already reported to be strongly enriched at 
enhancer sites (27,28) followed by H2BK120 and H3K4ac, 



Table 1. Positive Predicted Value comparison among enhancer 
predicting algorithms in a 5-fold cross-validation test 



Method (H3K4mel, H3K4me3) 


Enhancer PPV(±SD)" 


EMdeCODE 


96.9% ±1.10 


CSI-ANN (12) 


96.22% ± 2.14 


HMM (20) 


94.1% ± 0.89 


Heintzmann et al. (13) 


85% 



"Where available. 



two marks associated with active enhancers and pro- 
moters (29). Notably, EMdeCODE enhancers are also 
enriched in H3K4me3 (76%) and H3K27ac (70%) and 
depleted in H3K27me3 (<1%), indicating that most of 
them are likely active enhancers (30,31). 

Interestingly, this particular composition depends 
on modehng ChlP-Seq profiles as probability mass 
distributions and specifically on the normalization 
fZZpix)dx=\. When min-max normalization is applied 

(i.e. C = max 2]^gp e ^-'^ m eq. 0), EMdeCODE generates 
45 616 putative enhancers, 16961 (out of 22 878, 75%) 
overlapping with those ones produced by standard nor- 
malization. In this case the composition of this supermark 
is dominated by the well known marks H2AZ, H3K9mel, 
H3K4mel and H3K4me3 (Supplementary Figure SIB). 
Notably, the performance of this prediction is slightly 
better in Sensitivhy (77%, 6374/8244 of p300 regions 
overlap with this prediction) but not in Precision (89%), 
leading to a lower F score = 0.86. For this reason, the 
standard normalization has been preferred in 
EMdeCODE implementation. 

Retroviral integration site prediction 

EMdeCODE has been used to generate new virtual marks 
by combining 39 different histone marks previously ex- 
tracted from CD4+ T cells [methylation (32) and acetyl- 
ations (29)] with 6 histone marks from HeLa cells (21). 
These new marks have been tested for association with 
several gammaretroviral integration datasets with respect 
to randomly generated matched control datasets [similarly 
to (5)]. 

Indeed, EMdeCODE improved the prediction of retro- 
viral integration sites with respect to (5). The Sensitivity 
(Recall) increases 5% (to 10%) explaining up to 85% of 
integration sites in CD4-I- T cells. Accordingly, the related 
F score increases 3% (to 7%, Figure 3). The mixture co- 
efficient vector 0, calculated for each chromosome, stores 
the contribution of each mark to the final association. As 
expected, the optimal receipt includes all marks related to 
active transcription and open chromatin, whereas 




P300/ EMdeCODE P300/ CSI-ANN 



Figure 2. Chromosome projection mandalas calculated within 2Kbp for EMdeCODE and CSI-ANN enhancer predictions versus p300 enriched 
regions on chrl. Each dot on the mandala represents the center of a p300 region in polar coordinates. Angular distance corresponds to genomic 
location, and radial distance from the outer circle indicates the log-scaled distance in nucleotides from the closest predicted enhancer. Inner circles 
mark 1, 2 (in blue), 10, 20Kbp and so on. Blue filled circles represent p300 sites within 2Kbp from the nearest predicted enhancer. 
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H3K27me3 and other heterochromatin-related marks give 
no contribution (Supplementary Figure S2). All results 
have been checked using a 5-fold cross-vaHdation 
strategy similar to what presented in (5). 

Discriminating between coding and non-coding DNA 

Another possible application of EMdeCODE is the iden- 
tification and the quantification of differential epigenetic 
profiles between two distinct genomic features. Here, the 
algorithm has been used to identify epigenetic differences 
that could characterize coding from non-coding tran- 
scripts as, in this case, long non-coding RNA (IncRNA). 
It is beheved that non-coding mRNAs transcribed by the 
PolII transcriptional machine have the same epigenetic 
landscape found in promoter regions specific of coding 
mRNA (33). To investigate this hypothesis, a total of 
182 IncRNA, 84 not overlapping with any known 
expressed transcript within 6Kbp upstream of the TSS 
and 98 not overlapping within 6Kbp downstream of the 



3'-end of transcription regions (EoT), all actively 
transcribed in CD4+ T cells [accordingly to publicly avail- 
able RNASeq data (28,34)], with a clear PoUl peak in the 
promoter region [data from (32)] and more than 400 nt 
long have been selected. As a matched control, 84 + 98 
coding PoUl transcribed RefSeq genes with the same 6 
Kbp non-overlapping condition as selected IncRNA and 
with comparable transcriptional level and length have 
been randomly chosen (Supplementary Figure S3). 
EMdeCODE was then fed with the above-mentioned 39 
markers to identify possible discriminating factors over 
promoter regions, gene bodies (Tx) and the 3'-EoT. A 
comparison with 2000 randomly selected genomic 
regions has been performed as reference. 

Diagrams in Figure 4A report F score curves calculated 
for coding versus non-coding genes (Gene versus ncRNA), 
coding versus random (Gene versus Rnd) and non-coding 
versus random (ncRNA versus Rnd), considering gene 
bodies and increasing regions from TSS and 3' EoT 




MLV Helal 



MLVHelall 



MLV CD4-H 



HIVmlNmGAG 



□ santoni et al, (2010) H EMdeCODE 



Figure 3. Histograms of the F score (upper panel) and the percentage of associated proviruses wi2Kbp of the EMdeCODE-generated supermark 
(lower panel) with respect to MLV (I and II) proviruses (41,42) and HIVmlNmGAG chimera (41) integrated in HeLa in comparison with the 
'supermarker' previously reported by Santoni et al. 
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Figure 4. Coding (Gene), ncRNA and Rnd comparisons. (A) Supermarks and corresponding F scores are calculated for gene bodies (Tx) and for 
genomic regions considering regions extending upstream from TSS (negative sizes on .y axis) and downstream of 3'-end of transcription (EoT, 
positive sizes). Association is considered significant if F score >0.6. (B) Single-mark contributions (mixture coefficient vector 0) to supermark 
discriminating Gene versus IncRNA ordered by contribution in gene bodies. Only marks contributing >1% are reported. 



respectively. As expected, both coding genes and non- 
coding RNA (ncRNA) can easily distinguished from 
random sites in gene bodies (Tx, F score = 0.87 and 
0.86), whereas it is more difficult but still possible to 



discriminate between them (Tx, F score = 0.67). PolII 
transcribed IncRNA and coding genes do not differ in 
the epigenetic profile of the promoter region (7^ score 
<0.6), accordingly to what previously reported, but they 
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are clearly distinct up to 6 Kbp downstream the EoT 
{F score = 0.78). Figure 4B reports the contribution of 
each mark to the final supermark (i.e. the mixture coeffi- 
cient vector 0 defined in eq. 2) sorted by their percentage 
in gene bodies. The leading histone mark here is 
H3K4me3, unexpectedly followed by H3R2mel that 
is also the fourth contributor in downstream region after 
the well-known marks H2AZ, H3K36mel and H3K36me3 
and in comparison with random sites (Gene versus Rnd, 
Supplementary Figure S4). No functions in human cells 
have been associated to this mark so far. Notably, 
H3R2mel does not show up in IncRNA versus Rnd 
analysis (Supplementary Figure S5), indicating that this 
mark might be specific to coding regions. To support 
this hypothesis, 1 calculated H3R2mel and PolII density 
plots extending 6 Kbp 5' and 3' gene bodies of the two 
matched dataset. H3R2mel appears to be indeed enriched 
in coding gene bodies with respect to expressed IncRNA. 
Its density increases drastically in the promoter region 
reaching its maximum immediately after TSS and it de- 
creases smoothly but peaking again around the 3' EoT 
region (Figure 5A, black line). Conversely, H3R2niel is 
almost flat in IncRNA regions (Figure 5 A, red fine). This 
different behavior cannot be explained by differences of 
length, expression level or by PolII-mediated activation as 
clearly shown in Figure 5B. Moreover, correlation 
between level of expression of coding genes and IncRNA 
with H3R2mel density in T cell data is rather low: 0.25 
and 0.22, respectively. Interestingly, coding PolII density 
has a 'bump' at 3' EoT [interpreted as a slow release of 
PolII from DNA (35)], absent in IncRNA and that overlap 
with H3R2mel increased density in the same region 
(Figure 5A and B). 



DISCUSSION 

Understanding the combinatorial complexity of histone 
post-translational modifications is important to elucidate 
the mechanisms of gene expression regulation as well 
as protein-DNA interactions. In this perspective, 
EMdeCODE recreates the probability mass distribution 
of observing a specific event on a specific DNA location 
by aggregating ChlPSeq histone marks profiles into a new 
associated supermark. The algorithm is based on classical 
Expectation Maximization approach and mixture 
modehng with statistical selection. The goal is to 
maximize the association in term of F score, thus 
reaching an optimal tradeoff between Precision and 
Sensitivity. 

Compared with previous methods, EMdeCODE signifi- 
cantly improves the identification of putative enhancers, 
as demonstrated by the good approximation of the experi- 
mental p300 distribution in CD4+ T cells. One of the 
reasons for this increased performance can be ascribed 
to the choice of reconstructing a minimal profile by con- 
sidering only significant peaks from ChlPSeq data, thus 
consistently reducing the associated noise and defining 
a common background for all the available marks. 
Moreover, the interpretation of ChlPSeq profiles as prob- 
ability mass densities and the choice of the F0.5 score as 



H3R2mel 




Coding Genes 
IncRNA 



TSS 20% 40% 60% 80% EoT 



+5kB 




-6kB 



TSS 20% 40% 60% 80% EoT 



+ 6kB 



c 

100 





H3R2mel 


Coding Genes 
IncRNA 




Pol" 


V: 



3kB 



TSS 20% 40% 60% 80% EoT 



+6kB 



Figure 5. H3R2mel (A) and PolII (B) read density plots of coding 
(black) and non-coding (red) gene bodies normalized by length and 
divided in 50 non-overlapping bins +15 bins for ±6 Kbp extensions. 
(C) Density plots in untranscribed genes. 



cost function increase the contribution of less abundant 
but significantly associated marks and, at the same time, 
reduce the number of false positives. 

When applied to the prediction of retroviral integration 
sites, EMdeCODE outperforms the heuristic method 
reported in (5) and generates a supermark localized 
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within 2 Kbp of 75-85% of gammaretroviral proviruses, 
while increasing the corresponding F score. It is worth to 
observe that the best performance is obtained in CD4+ T 
cells where the highest number of marks is also available. 
This again demonstrates that EMdeCODE exploits and 
properly balances even minor contributions that are 
underrated by the heuristic approach. 

EMdeCODE can also be used to determine whether 
some marks can be used to discriminate among genomic 
events. Indeed, it has been capable of determining 
supermarks which distinguish coding from non-coding 
genes, despite substantial epigenetic homogeneity of tran- 
scriptional regions having been previously reported (36). 
This seems to be true only for the promoter region up to 
6 Kbp (upstream from the TSS), where no significant dif- 
ferences can be observed. Regions downstream (from 2 to 
6 Kbp) of coding genes, instead, are particularly enriched 
in transcription associated marks, such as H2AZ, 
H3K4me3 and H3K36mel-3 among others, likely ac- 
counting for a complex regulatory structure that is less 
pronounced for most of non-coding transcripts. The 
gene body, also, is clearly enriched of H3K36me3 and 
H3K36mel in coding genes, especially in comparison 
with random sites (Supplementary Figure S4), in line 
with what has been previously reported (37). Perhaps 
less expected is the strong influence of H3R2mel in 
coding gene bodies with respect to IncRNA and random 
regions as shown in Figure 4B and Supplementary Figure 
S4. The specificity of this mark for coding regions is con- 
firmed by its absence in IncRNA compared with random 
regions even if they can be clearly discriminated (F score 
>0.75, Figure 4A and Supplementary Figure S5). 
H3R2mel significantly contributes to coding genes char- 
acterization in 3' EoT regions and 6 Kbp downstream. 
H3R2mel read density profile shows a pecuHar shape in 
coding regions of the matched dataset by markedly 
outUning the gene body. Conversely, H3R2mel is consid- 
erably depleted in non-coding gene bodies, explaining why 
it has been picked up by EMdeCODE (Figure 5A). H3R2 
is methylated by PRMTl and CARMl that subsequently 
coactivate nuclear hormone receptor-mediated transcrip- 
tion, suggesting that the H3R2 methylation may be 
involved in transcriptional activity (38), as recently 
observed in Drosophila (39). On the other hand, 
H3R2mel is not particularly enriched in active promoter 
regions of human DNA (32) and it is not strongly 
correlated with expression levels [see Results and (32)]. 
Interestingly, a recent study proposed that symmetrical 
dimethylation of H3R2 may have a role in euchromatin 
maintenance by mediating histone H3K4 methylation and 
H3 and H4 acetylation (40). Figure 5C reports PolII and 
H3R2mel densities of 150 untranscribed 6 Kbp 
non-overlapping coding and non-coding genes. Again, 
monomethylated arginin of histone 3 is denser in coding 
regions even in complete absence of transcriptional 
activity (flat PolII density). These observations together 
with the other results obtained by EMdeCODE support 
the hypothesis that H3R2 is a key amino acid residue 
epigenetically involved in the maintenance and the protec- 
tion of coding regions. 
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